26 research outputs found
Recommended from our members
Enhancing Usability and Explainability of Data Systems
The recent growth of data science expanded its reach to an ever-growing user base of nonexperts, increasing the need for usability, understandability, and explainability in these systems. Enhancing usability makes data systems accessible to people with different skills and backgrounds alike, leading to democratization of data systems. Furthermore, proper understanding of data and data-driven systems is necessary for the users to trust the function of the systems that learn from data. Finally, data systems should be transparent: when a data system behaves unexpectedly or malfunctions, the users deserve proper explanation of what caused the observed incident. Unfortunately, most existing data systems offer limited usability and support for explanations: these systems are usable only by experts with sound technical skills, and even expert users are hindered by the lack of transparency into the systems\u27 inner workings and functions. The aim of my thesis is to bridge the usability gap between nonexpert users and complex data systems, aid all sort of users, including the expert ones, in data and system understanding, and provide explanations that help reason about unexpected outcomes involving data systems. Specifically, my thesis has the following three goals: (1) enhancing usability of data systems for nonexperts, (2) enable data understanding that can assist users in a variety of tasks such as achieving trust in data-driven machine learning, gaining data understanding, and data cleaning, and (3) explaining causes of unexpected outcomes involving data and data systems.
For enhancing usability, we focus on example-driven user intent discovery. We develop systems based on example-driven interactions in two different settings: querying relational databases and personalized document summarization. Towards data understanding, we develop a new data-profiling primitive that can characterize tuples for which a machine-learned model is likely to produce untrustworthy predictions. We also develop an explanation framework to explain causes of such untrustworthy predictions. Additionally, this new data-profiling primitive enables interactive data cleaning. Finally, we develop two explanation frameworks, tailored to provide explanations in debugging data system components, including the data itself. The explanation frameworks focus on explaining the root cause of a concurrent application\u27s intermittent failure and exposing issues in the data that cause a data-driven system to malfunction
Rapidash: Efficient Constraint Discovery via Rapid Verification
Denial Constraint (DC) is a well-established formalism that captures a wide
range of integrity constraints commonly encountered, including candidate keys,
functional dependencies, and ordering constraints, among others. Given their
significance, there has been considerable research interest in achieving fast
verification and discovery of exact DCs within the database community. Despite
the significant advancements in the field, prior work exhibits notable
limitations when confronted with large-scale datasets. The current
state-of-the-art exact DC verification algorithm demonstrates a quadratic
(worst-case) time complexity relative to the dataset's number of rows. In the
context of DC discovery, existing methodologies rely on a two-step algorithm
that commences with an expensive data structure-building phase, often requiring
hours to complete even for datasets containing only a few million rows.
Consequently, users are left without any insights into the DCs that hold on
their dataset until this lengthy building phase concludes. In this paper, we
introduce Rapidash, a comprehensive framework for DC verification and
discovery. Our work makes a dual contribution. First, we establish a connection
between orthogonal range search and DC verification. We introduce a novel exact
DC verification algorithm that demonstrates near-linear time complexity,
representing a theoretical improvement over prior work. Second, we propose an
anytime DC discovery algorithm that leverages our novel verification algorithm
to gradually provide DCs to users, eliminating the need for the time-intensive
building phase observed in prior work. To validate the effectiveness of our
algorithms, we conduct extensive evaluations on four large-scale production
datasets. Our results reveal that our DC verification algorithm achieves up to
40 times faster performance compared to state-of-the-art approaches.Comment: comments and suggestions are welcome
Conversational Challenges in AI-Powered Data Science: Obstacles, Needs, and Design Opportunities
Large Language Models (LLMs) are being increasingly employed in data science
for tasks like data preprocessing and analytics. However, data scientists
encounter substantial obstacles when conversing with LLM-powered chatbots and
acting on their suggestions and answers. We conducted a mixed-methods study,
including contextual observations, semi-structured interviews (n=14), and a
survey (n=114), to identify these challenges. Our findings highlight key issues
faced by data scientists, including contextual data retrieval, formulating
prompts for complex tasks, adapting generated code to local environments, and
refining prompts iteratively. Based on these insights, we propose actionable
design recommendations, such as data brushing to support context selection, and
inquisitive feedback loops to improve communications with AI-based assistants
in data-science tools.Comment: 24 pages, 8 figure
DataPrism: Exposing Disconnect between Data and Systems
peer reviewedAs data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes-in terms of data profiles-of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques
Increasing frailty is associated with higher prevalence and reduced recognition of delirium in older hospitalised inpatients: results of a multi-centre study
Purpose:
Delirium is a neuropsychiatric disorder delineated by an acute change in cognition, attention, and consciousness. It is common, particularly in older adults, but poorly recognised. Frailty is the accumulation of deficits conferring an increased risk of adverse outcomes. We set out to determine how severity of frailty, as measured using the CFS, affected delirium rates, and recognition in hospitalised older people in the United Kingdom.
Methods:
Adults over 65 years were included in an observational multi-centre audit across UK hospitals, two prospective rounds, and one retrospective note review. Clinical Frailty Scale (CFS), delirium status, and 30-day outcomes were recorded.
Results:
The overall prevalence of delirium was 16.3% (483). Patients with delirium were more frail than patients without delirium (median CFS 6 vs 4). The risk of delirium was greater with increasing frailty [OR 2.9 (1.8â4.6) in CFS 4 vs 1â3; OR 12.4 (6.2â24.5) in CFS 8 vs 1â3]. Higher CFS was associated with reduced recognition of delirium (OR of 0.7 (0.3â1.9) in CFS 4 compared to 0.2 (0.1â0.7) in CFS 8). These risks were both independent of age and dementia.
Conclusion:
We have demonstrated an incremental increase in risk of delirium with increasing frailty. This has important clinical implications, suggesting that frailty may provide a more nuanced measure of vulnerability to delirium and poor outcomes. However, the most frail patients are least likely to have their delirium diagnosed and there is a significant lack of research into the underlying pathophysiology of both of these common geriatric syndromes
Reducing the environmental impact of surgery on a global scale: systematic review and co-prioritization with healthcare workers in 132 countries
Abstract
Background
Healthcare cannot achieve net-zero carbon without addressing operating theatres. The aim of this study was to prioritize feasible interventions to reduce the environmental impact of operating theatres.
Methods
This study adopted a four-phase Delphi consensus co-prioritization methodology. In phase 1, a systematic review of published interventions and global consultation of perioperative healthcare professionals were used to longlist interventions. In phase 2, iterative thematic analysis consolidated comparable interventions into a shortlist. In phase 3, the shortlist was co-prioritized based on patient and clinician views on acceptability, feasibility, and safety. In phase 4, ranked lists of interventions were presented by their relevance to high-income countries and lowâmiddle-income countries.
Results
In phase 1, 43 interventions were identified, which had low uptake in practice according to 3042 professionals globally. In phase 2, a shortlist of 15 intervention domains was generated. In phase 3, interventions were deemed acceptable for more than 90 per cent of patients except for reducing general anaesthesia (84 per cent) and re-sterilization of âsingle-useâ consumables (86 per cent). In phase 4, the top three shortlisted interventions for high-income countries were: introducing recycling; reducing use of anaesthetic gases; and appropriate clinical waste processing. In phase 4, the top three shortlisted interventions for lowâmiddle-income countries were: introducing reusable surgical devices; reducing use of consumables; and reducing the use of general anaesthesia.
Conclusion
This is a step toward environmentally sustainable operating environments with actionable interventions applicable to both highâ and lowâmiddleâincome countries